强化学习被广泛用于在与环境互动时需要执行顺序决策的应用中。当决策要求包括满足一些安全限制时,问题就变得更加具有挑战性。该问题在数学上是作为约束的马尔可夫决策过程(CMDP)提出的。在文献中,可以通过无模型的方式解决各种算法来解决CMDP问题,以实现$ \ epsilon $ - 最佳的累积奖励,并使用$ \ epsilon $可行的政策。 $ \ epsilon $可行的政策意味着它遭受了违规的限制。这里的一个重要问题是,我们是否可以实现$ \ epsilon $ - 最佳的累积奖励,并违反零约束。为此,我们主张使用随机原始偶对偶方法来解决CMDP问题,并提出保守的随机原始二重算法(CSPDA),该算法(CSPDA)显示出$ \ tilde {\ tilde {\ Mathcal {o}} \ left(1 /\ epsilon^2 \ right)$样本复杂性,以实现$ \ epsilon $ - 最佳累积奖励,违反零约束。在先前的工作中,$ \ epsilon $ - 最佳策略的最佳可用样本复杂性是零约束的策略是$ \ tilde {\ Mathcal {o}}} \ left(1/\ epsilon^5 \ right)$。因此,与最新技术相比,拟议的算法提供了重大改进。
translated by 谷歌翻译
平均现场控制(MFC)是减轻合作多功能加强学习(MARL)问题的维度诅咒的有效方法。这项工作考虑了可以分离为$ k $课程的$ n _ {\ mathrm {pop}} $异质代理的集合,以便$ k $ -th类包含$ n_k $均匀的代理。我们的目标是通过其相应的MFC问题证明这一异构系统的Marl问题的近似保证。我们考虑三种情景,所有代理商的奖励和转型动态分别被视为$(1)美元的职能,每班的所有课程,$(2)美元和$(3) $边际分布的整个人口。我们展示,在这些情况下,$ k $ -class marl问题可以通过mfc近似于$ e_1 = mathcal {o}(\ frac {\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {u} |}}}}}} {n _ {\ mathrm {pop}}} \ sum_ {k} \ sqrt {k})$,$ e_2 = \ mathcal {o}(\ left [\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {u} |} \ \ sum_ {k} \ frac {1} {\ sqrt {n_k}})$和$ e_3 = \ mathcal {o} \ left(\ left [\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {} |} \ leftle] \ left [\ frac {a} {n _ {\ mathrm {pop}}} \ sum_ {k \在[k]}} \ sqrt {n_k} + \ frac {n} {\ sqrt {n} {\ sqrt {n \ mathrm {pop}}} \右] \ over)$,其中$ a,b $是一些常数和$ | mathcal {x} |,| \ mathcal {u} | $是每个代理的状态和行动空间的大小。最后,我们设计了一种基于自然的梯度(NPG)基于NPG的算法,它在上面规定的三种情况下,可以在$ \ Mathcal {O}(E_J)$错误中收敛到$ \ Mathcal的示例复杂度{ o}(e_j ^ { - 3})$,j \ in \ {1,2,3 \} $。
translated by 谷歌翻译
我们考虑了马尔可夫决策过程(CMDP)的问题,其中代理与Markov Unichain决策过程进行交互。在每次互动中,代理都会获得奖励。此外,还有$ K $成本功能。该代理商的目标是最大程度地提高长期平均奖励,同时使$ k $的长期平均成本低于一定阈值。在本文中,我们提出了CMDP-PSRL,这是一种基于后取样的算法,使用该算法,代理可以学习与CMDP相互作用的最佳策略。此外,对于具有$ s $州的MDP,$ A $ ACTICE和DIAMETER $ D $,我们证明,遵循CMDP-PSRL算法,代理商可能会束缚不累积最佳策略奖励的遗憾。 (poly(dsa)\ sqrt {t})$。此外,我们表明,任何$ k $约束的违规行为也受$ \ tilde {o}(poly(dsa)\ sqrt {t})$的限制。据我们所知,这是第一批获得$ \ tilde {o}(\ sqrt {t})$遗憾的Ergodic MDP的界限,并具有长期平均约束。
translated by 谷歌翻译
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.
translated by 谷歌翻译
Parkinson's disease is marked by altered and increased firing characteristics of pathological oscillations in the brain. In other words, it causes abnormal synchronous oscillations and suppression during neurological processing. In order to examine and regulate the synchronization and pathological oscillations in motor circuits, deep brain stimulators (DBS) are used. Although machine learning methods have been applied for the investigation of suppression, these models require large amounts of training data and computational power, both of which pose challenges to resource-constrained DBS. This research proposes a novel reinforcement learning (RL) framework for suppressing the synchronization in neuronal activity during episodes of neurological disorders with less power consumption. The proposed RL algorithm comprises an ensemble of a temporal representation of stimuli and a twin-delayed deep deterministic (TD3) policy gradient algorithm. We quantify the stability of the proposed framework to noise and reduced synchrony using RL for three pathological signaling regimes: regular, chaotic, and bursting, and further eliminate the undesirable oscillations. Furthermore, metrics such as evaluation rewards, energy supplied to the ensemble, and the mean point of convergence were used and compared to other RL algorithms, specifically the Advantage actor critic (A2C), the Actor critic with Kronecker-featured trust region (ACKTR), and the Proximal policy optimization (PPO).
translated by 谷歌翻译
Problem statement: Standardisation of AI fairness rules and benchmarks is challenging because AI fairness and other ethical requirements depend on multiple factors such as context, use case, type of the AI system, and so on. In this paper, we elaborate that the AI system is prone to biases at every stage of its lifecycle, from inception to its usage, and that all stages require due attention for mitigating AI bias. We need a standardised approach to handle AI fairness at every stage. Gap analysis: While AI fairness is a hot research topic, a holistic strategy for AI fairness is generally missing. Most researchers focus only on a few facets of AI model-building. Peer review shows excessive focus on biases in the datasets, fairness metrics, and algorithmic bias. In the process, other aspects affecting AI fairness get ignored. The solution proposed: We propose a comprehensive approach in the form of a novel seven-layer model, inspired by the Open System Interconnection (OSI) model, to standardise AI fairness handling. Despite the differences in the various aspects, most AI systems have similar model-building stages. The proposed model splits the AI system lifecycle into seven abstraction layers, each corresponding to a well-defined AI model-building or usage stage. We also provide checklists for each layer and deliberate on potential sources of bias in each layer and their mitigation methodologies. This work will facilitate layer-wise standardisation of AI fairness rules and benchmarking parameters.
translated by 谷歌翻译
Supervised approaches generally rely on majority-based labels. However, it is hard to achieve high agreement among annotators in subjective tasks such as hate speech detection. Existing neural network models principally regard labels as categorical variables, while ignoring the semantic information in diverse label texts. In this paper, we propose AnnoBERT, a first-of-its-kind architecture integrating annotator characteristics and label text with a transformer-based model to detect hate speech, with unique representations based on each annotator's characteristics via Collaborative Topic Regression (CTR) and integrate label text to enrich textual representations. During training, the model associates annotators with their label choices given a piece of text; during evaluation, when label information is not available, the model predicts the aggregated label given by the participating annotators by utilising the learnt association. The proposed approach displayed an advantage in detecting hate speech, especially in the minority class and edge cases with annotator disagreement. Improvement in the overall performance is the largest when the dataset is more label-imbalanced, suggesting its practical value in identifying real-world hate speech, as the volume of hate speech in-the-wild is extremely small on social media, when compared with normal (non-hate) speech. Through ablation studies, we show the relative contributions of annotator embeddings and label text to the model performance, and tested a range of alternative annotator embeddings and label text combinations.
translated by 谷歌翻译
Dense retrievers have made significant strides in obtaining state-of-the-art results on text retrieval and open-domain question answering (ODQA). Yet most of these achievements were made possible with the help of large annotated datasets, unsupervised learning for dense retrieval models remains an open problem. In this work, we explore two categories of methods for creating pseudo query-document pairs, named query extraction (QExt) and transferred query generation (TQGen), to augment the retriever training in an annotation-free and scalable manner. Specifically, QExt extracts pseudo queries by document structures or selecting salient random spans, and TQGen utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that dense retrievers trained with individual augmentation methods can perform comparably well with multiple strong baselines, and combining them leads to further improvements, achieving state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
translated by 谷歌翻译
Active target sensing is the task of discovering and classifying an unknown number of targets in an environment and is critical in search-and-rescue missions. This paper develops a deep reinforcement learning approach to plan informative trajectories that increase the likelihood for an uncrewed aerial vehicle (UAV) to discover missing targets. Our approach efficiently (1) explores the environment to discover new targets, (2) exploits its current belief of the target states and incorporates inaccurate sensor models for high-fidelity classification, and (3) generates dynamically feasible trajectories for an agile UAV by employing a motion primitive library. Extensive simulations on randomly generated environments show that our approach is more efficient in discovering and classifying targets than several other baselines. A unique characteristic of our approach, in contrast to heuristic informative path planning approaches, is that it is robust to varying amounts of deviations of the prior belief from the true target distribution, thereby alleviating the challenge of designing heuristics specific to the application conditions.
translated by 谷歌翻译
We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
translated by 谷歌翻译